90 ◾ Bioinformatics
The coverage describes how often, on average, a reference sequence is covered by bases
from the reads. The sequencing depth is defined as the total number of reads that align to
or cover a specific locus (Figure 3.1).
The de novo assembly strategy has been greatly enhanced by the single-molecule real-
time (SMRT) sequencing (Pacific Biosciences (PacBio)) and nanopore sequencing (Oxford
Nanopore Technologies (ONT)), which produce long reads. In general, the de novo genome
assembly depends on the read length, sequencing depth, and the assembling algorithm.
Based on the sequencing technology used, reads can be classified into short reads, which
are produced by the most sequencing technologies including Illumina, and long reads pro-
duced by PacBio and ONT. In general, either sequence alignment or graphs approach is
used for the de novo assembly. The algorithms used by the de novo genome assemblers can
be classified into (i) greedy approach, (ii) overlap-layout-consensus with Hamiltonian path,
or (iii) de Bruijn graph with Eulerian path. The last two approaches share the idea that a
genome assembly can be represented as a path in graphs.
3.1.1 Greedy Algorithm
The greedy algorithm was the first one used for de novo genome assembly. Like multi-
ple sequence alignment, it depends on the similarity between reads to create a pileup of
aligned sequences, which are collapsed to create contigs from the consensus sequences. The
greedy algorithm uses pairwise alignment to compare all reads to identify the most simi-
lar reads with sufficient overlaps and then it joins them together. The reads with the sig-
nificant overlaps are merged. The algorithm continues by conducting pairwise alignments
on the merged reads to identify the sequences with significant overlaps and merges them.
This process will continue repeatedly until there is no more merging. When there is low
sequencing depth, gaps may be present between the assembled contigs; deep sequencing
and the use of paired-end reads can reduce gaps and allow longer contigs. Greedy assem-
blers include TIGR assembler [2] and PHRAP [3]. For instance, assume that we have the
reads CATTC, ATTCG, GTCAT, TTCGC, and GGCAT. Figure 3.2 shows how the greedy
algorithm works to assemble the consensus sequence GTCATTCGCAT from those reads.
3.1.2 Overlap-Consensus Graphs
The overlap graph method was first used for assembling genomes from sequences produced
by Sanger sequencing method and then it was extended by some assemblers for NGS reads.
FIGURE 3.1 Sequencing coverage and depth.